## [1] "wine_id" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Introduction: The red wines data set provided by Udacity is described as
including 1,599 red wines with 11+ variables. While in CSV format I added wine_id to help in the analysis if we needed an unique identifier for the wines. In this analysis we will be looking at the chemical properties and how they influence the quality of the red wine. In the supporting documentation it describes the quality was decided by wine experts based on a sample of these wines. (https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt) This analysis will focus on a few of the variables that I predicate will have an impact on wine quality. As stated in the document about the collect of the data “volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste,” therefore I would expect to see the quality of the wine to go down as the volatile acidity rises. Also pH, I would expect as those wines with a more acidtic pH to be lower on the quality scale.
First, I would like to look at a summary of the dataset.
## wine_id fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## 'data.frame': 1599 obs. of 14 variables:
## $ wine_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.f : Factor w/ 7 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
Looking at the structure of the data. Quality is an int type and we will need it to be a factor for use as a categorical data later in our charting.
Quality is the first variable and the one of most interest to look at. It appears that the quality ranges from 3.000 to 8.000.
##Univariate Plots Section
This histogram gives a look at the count of the wines that have been grouped into each of the quality numeric classifications. It appears that the majority of the red wines are a 5 in quality, but let’s look at a summary of that information as well.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
This table shows that 681 wines are in the 5 classification and 638 are in the 6 classification of quality. This shows that most wines call in the middle of the quality ratings.
This histogram of pH has a fairly normal distribution shape.
This histogram has two peak, one around 0.4 and the other around 0.6.
This histogram is skewed to the left.
Citric Acid is a bimodial histogram
This histogram has a high spike around 2 and is positively skewed. Most of the wines appear to have residual sugar of less than 4.
Most of the results are clustered between 0 and 0.1 however we have a few that fall way outside of that as well. This leads me to believe that chlorides will have no correlation to quality rating.
This histogram is skewed roght as well with a few outliners outside of 60.
Skewed right again with a few closer to 300.
Density has a normal distribution on this histogram.
##Univariate Analysis:
When looking at the red wine data in the univariate analysis the main focus is to look at the variables and their distribution in the wines in the dataset. The main one of interest is quality, but it is important to look at the other variables and see if we can start to see which variables will most heavily impact the quality rating of the wines.
I would think that acidity will cause a wine to be rated lower, so I will be looking at the pH, volatile acidity and the fixed acidity. I would also anticipate that wines with lots of residual sugars will be rated lower due to their sweetness.
I added wine_id to my dataset in cause we need a unique identifier for the wine. I have not preformed any functions to tidy or adjust the data. Looking over these histograms,
Also, did some additional reading and research on wine and they add sulfur dioxide to the wine. In the article, it was discussed that this may have an impact on the flavor. I will be looking at the sulphates and sulfur dioxide variables as well.
##Bivariate Plots Section
In this next section, I would like to look at the relationship between acidity, sugar, alcholol and quality.
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
This shows a medium negative correlation between quality and volatile acidity.
## `geom_smooth()` using formula 'y ~ x'
Scatter plot of quality and volatile acidity, showing a medium negative correlation.
##
## Pearson's product-moment correlation
##
## data: quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
## `geom_smooth()` using formula 'y ~ x'
These graphs and the correlation shows a small positive correlation for quality and fixed acidity.
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Rounding this to 0.5 shows a large positive correlation to alcohol and quality.
## `geom_smooth()` using formula 'y ~ x'
This shows a large positive correlation to alcohol and quality. It appears that higher quality wines have a higher alcohol cotent.
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## `geom_smooth()` using formula 'y ~ x'
##
## Pearson's product-moment correlation
##
## data: quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
## `geom_smooth()` using formula 'y ~ x'
pH seems to have very little correlation to quality rating.
##
## Pearson's product-moment correlation
##
## data: quality and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
## `geom_smooth()` using formula 'y ~ x'
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## `geom_smooth()` using formula 'y ~ x'
##
## Pearson's product-moment correlation
##
## data: quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
## `geom_smooth()` using formula 'y ~ x'
##
## Pearson's product-moment correlation
##
## data: density and fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
## `geom_smooth()` using formula 'y ~ x'
As density increases fixed acidity increases as well. This shows a strong correlation, but not one that I would have expected. I would expected a stronger correlation to the residual sugars and density.
##Bivariate Analysis
It was observed that alcohol content and quality have a positive correlation and volatile acidity and quality have a negative correlation. These two factors seem to impact the quality rating of the wines.
There appears to be a strong correlation between density and fixed acidity. Looking that the data I would have thought that the residual sugar would have a stronger correlation to density than fixed acidity.
Strongest relationship in realtion to quality found was the alcohol content.
##Multivariate Plots Section
## Warning: Removed 43 rows containing missing values (geom_point).
This shows that most of the higher quality wines have a lower Volatile Acidity level. Most of the wines have a pH better 3 and 3.5 so that doesn’t seem to be as large of an impoact on the quality rating.
This chart shows that lower volatile acidity and higher alcohol content is present in the higher quality wines.
## Warning: Removed 75 rows containing missing values (geom_point).
## Warning: Removed 58 rows containing missing values (geom_point).
The chart shows that wines in the higher quality ratings have a lower volatile acidity and a higher sulphate content.
##Multivariate Analysis
These were great for refining the relationships seen in the other charts. Wines in the higher quality ratings have a lower volatile acidity and a higher sulphate content. Wines with lower volatile acidity and higher alcohol content have a higher quality rating.
##Final Plots and Summary
Plot One
This the is overall distribution of the wine qualities for the red wine data set. As you can see the majority of wines fall into the rating of 5 and 6.
Plot Two
This chart compared Volatile Acidity, Alcohol content and Quality rate of the red wine data set. It shows the some clustering of the higher quality wines in the area of lower volatile acidity and higher alcohol content.
Plot Three
## Warning: Removed 58 rows containing missing values (geom_point).
This chart compared Volatile Acidity, Sulphates content and Quality rate of the red wine data set. It shows the some clustering of the higher quality wines in the area of lower volatile acidity and higher sulphates present in the wine.
##Reflection
In analyzing this data, I have discovered that volatile acidity has a negative impact on wine rating. This doesn’t come as a surprise since through my pre-reading for this project they mentioned that higher volatile acidity can make the wine taste more like vinegar. I think it was a surprise that alcohol content wines had the strongest positive correlation to the quality rating. I would like the alcohol content would have no impact on the flavor and therefore no impact on the rating. The sulphates present and the correlation to the wine rating was the most interesting to me, since I did not go into this project thinking sulphates with would impact the rating. I think more data about the temperature of the storage, types of grapes, types of barrels, and added ingredients would have been interesting to explore as well. I did struggle with the subjective nature of the rating for the wine. I feel that we could use this data coupled with other data points to possible eliminate the need for the human element and thus the subject rating.
One thing that was very helpful for me in this course was as I worked through the lessons, I would work with not only the data set for the lesson, but also my project data set. This was extremely helpful to start thinking about the data as I worked through the course.
References: http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
https://www.torres.es/en/blog/how-wine-made/4-factors-determine-wine-quality#
https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors/index.html